Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines

نویسندگان

Mainak Ghosh

Ashwini Raina

Le Xu

Xiaoyao Qian

Indranil Gupta

Himanshu Gupta

چکیده

This paper targets the growing area of interactive data analytics engines. We present a system called Getafix that intelligently decides replication levels and replica placement for data segments, in a way that is responsive to changing popularity of data access by incoming queries. We present an optimal solution to the static version of the problem, achieving minimality in both makespan and replication factor. Based on this intuition we build the Getafix system to handle queries and segments arriving in real time. We integrated Getafix into Druid, a modern open-source interactive data analytics engine. We present experimental results using workloads from Yahoo!’s production Druid cluster. Compared to existing work, Getafix achieves comparable query latency (both average and tail), while using 1.45-2.15× less memory in a private cloud. In a public cloud, for a 100 TB hot dataset size, Getafix can cut dollar costs by as much as 10 million annually with negligible performance impact.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DimmWitted: A Study of Main-Memory Statistical Analytics

We perform the first study of the tradeoff space of access methods and replication to support statistical analytics using first-order methods executed in the main memory of a Non-Uniform Memory Access (NUMA) machine. Statistical analytics systems differ from conventional SQL-analytics in the amount and types of memory incoherence that they can tolerate. Our goal is to understand tradeoffs in ac...

متن کامل

Graph Analytics on Relational Databases

Graph analytics has become increasing popular in the recent years. Conventionally, data is stored in relational databases that have been refined over decades, resulting in highly optimized data processing engines. However, the awkwardness of expressing iterative queries in SQL makes the relational queryprocessing model inadequate for graph analytics, leading to many alternative solutions. Our r...

متن کامل

Real-Time Analytics as the Killer Application for Processing-In-Memory

While Processing-In-Memory (PIM) has been widely researched for the last two decades, it was never truly adopted by the industry and remains mostly within the academic research realm. This is mainly because (1) inmemory compute engines were too slow, and (2) a realworld application that could really benefit from PIM was never identified. In recent years, the first argument became untenable, but...

متن کامل

Design and Implementation of a Real-Time Interactive Analytics System for Large Spatio-Temporal Data

In real-time interactive data analytics, the user expects to receive the results of each query within a short time period such as seconds. This is especially challenging when the data is big (e.g., on the scale of petabytes), and the analytics system runs on top of cloud infrastructure (e.g., thousands of interconnected commodity servers). We have been building such a system, called OceanRT, fo...

متن کامل

Getafix: Workload-aware Distributed Interactive Analytics

Distributed interactive analytics engines (Druid, Redshift, Pinot) need to achieve low query latency while using the least storage space. This paper presents a solution to the problem of replication of data blocks and routing of queries. Our techniques decide the replication level of individual data blocks (based on popularity, access counts), as well as output optimal placement patterns for su...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Popular is Cheaper: Curtailing Memory Costs in Interactive Analytics Engines

نویسندگان

چکیده

منابع مشابه

DimmWitted: A Study of Main-Memory Statistical Analytics

Graph Analytics on Relational Databases

Real-Time Analytics as the Killer Application for Processing-In-Memory

Design and Implementation of a Real-Time Interactive Analytics System for Large Spatio-Temporal Data

Getafix: Workload-aware Distributed Interactive Analytics

عنوان ژورنال:

اشتراک گذاری